arXiv:1502.04154vl [cs.SI] 14 Feb 2015 


Extracting Hidden Gronps and their Structnre from Streaming 

Interaction Data 


Mark K. Goldberg Mykola Hayvanovych Malik Magdon-Ismail 

goldberg@cs.rpi.edu hayvam@cs.rpi.edu magdon@cs.rpi.edu 

William A. Wallace 
wallaw@rpi.edu 

Rensselaer Polytechnic Institute, 

110 8th Street, Troy, NY 12180, USA. 

{goldberg,hayvam,magdon}@cs.rpi.edu; wallaw@rpi.edu. 

February 17, 2015 


Abstract 

When actors in a social network interact, it usually means they have some general goal 
towards which they are collaborating. This could be a research collaboration in a company or a 
foursome planning a golf game. We call such groups planning groups. In many social contexts, 
it might be possible to observe the dyadic interactions between actors, even if the actors do 
not explicitly declare what groups they belong too. When groups are not explicitly declared, 
we call them hidden groups. Our particular focus is hidden planning groups. By virtue of 
their need to further their goal, the actors within such groups must interact in a manner which 
differentiates their communications from random background communications. In such a case, 
one can infer (from these interactions) the composition and structure of the hidden planning 
groups. We formulate the problem of hidden group discovery from streaming interaction data, 
and we propose efficient algorithms for identifying the hidden group structures by isolating the 
hidden group’s non-random, planning-related, communications from the random background 
communications. We validate our algorithms on real data (the Enron email corpus and Blog 
communication data). Analysis of the results reveals that our algorithms extract meaningful 
hidden group structures. 


1 Introduction 

Communication networks (telephone, email, Internet chatroom, etc.) facilitate rapid information 
exchange among millions of users around the world, providing the ideal environment for groups 
to plan their activity undetected: their communications are embedded (hidden) within the myriad 
of unrelated communications. A group may communicate in a structured way while not being 
forthright about its existence. However, when the group must exchange communications to plan 
some activity, their need to communicate usually imposes some structure on their communica¬ 
tions. We develop statistical and algorithmic approaches for discovering such hidden groups that 
plan an activity. Hidden group members may have non-planning related communications, be ma¬ 
licious (e.g. a terrorist group) or benign (e.g. a golf foursome). We liberally use “hidden group” 
for all such groups involved in planning, even though they may not intentionally be hiding their 
communications. 
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00 

A^C 

Golf tomorrow? Tell everyone. 

00 

A-^G 

05 

C^F 

Alice mentioned golf tomorrow. 

05 

G^F 

06 

A^B 

Hey, golf tomorrow? Spread the word 

06 

A^B 

12 

A^B 

Tee time: Sam; Place: Pinehurst. 

12 

A^B 

13 

F^G 

Hey guys, golf tomorrow . 

13 

F^G 

13 

F^H 

Hey guys, golf tomorrow . 

13 

F^H 

15 

A->C 

Tee time: Sam; Place: Pinehurst. 

15 

A-^G 

20 

B^D 

We’re playing golf tomorrow. 

20 

B^D 

20 

B^E 

We’re playing golf tomorrow. 

20 

B^E 

22 

C^F 

Tee time: Sam; Place: Pinehurst. 

22 

G^F 

25 

B->D 

Tee time: Sam; Place: Pinehurst. 

25 

B^D 

25 

B->E 

Tee time Sam, Pinehurst. 

25 

B^E 

31 

F^G 

Tee time Sam, Pinehurst. 

31 

F~>G 

31 

F-^H 

Tee off Sam,Pinehurst. 

31 

F^H 


(a) (b) 


Figure 1: Streaming hidden group with two waves of planning (a). Streaming group without 
message content - only time, sender id and receiver id are available (b). 


The tragic events of September 11, 2001 underline the need for a tool which aides in the 
discovery of hidden groups during their planning stage, before implementation. One approach 
to discovering such groups is using correlations among the group member communications. The 
communication graph of the society is defined by its actors (nodes) and communications (edges). 
We do not use communication content, even though it can be informative through some natural 
language processing, because such analysis is time consuming and intractable for large datasets. 
We use only the time-stamp, sender and recipient ID of a message. 

Our approach of discovering hidden groups is based on the observation that the pattern of 
communications exhibited by a group pursuing a common objective is different from that of a 
randomly selected set of actors: any group, even one which tries to hide itself, must communicate 
regularly to plan. One possible instance of such correlated communication is the occurrence of a 
repeated communication pattern. Temporal correlation emerges as the members of a group need to 
systematically exchange messages to plan their future activity. This correlation among the group 
communications will exist throughout the planning stage, which may be some extended period of 
time. If the planning occurs over a long enough period, this temporal correlation will stand out 
against a random background of communications and hence can be detected. 

2 Streaming Hidden Groups 

Unlike in the cyclic hidden group setting [9] where all of the hidden group members communicate 
within some characteristic time period, and do so repeatedly over a consecutive sequence of time 
periods. A streaming hidden group doesn’t obey such strict requirements for its communication 
pattern. Hidden groups don’t necessarily display a fixed time-cycle, during which all members 
of group members exchange messages, but whenever a step in the planning needs to occur, some 
hidden group member initiates a communication, which percolates through the hidden group. The 
hidden group problem may still be formulated as one of finding repeated (possibly overlapping) 
communication patterns. An example of a streaming hidden group is illustrated in Fig. [11(a) with 
the same group planning golf game. Given the message content, it is easy to identify two “waves” 
of communication. The first wave (in darker font) establishes the golf game; the second wave (in 
lighter font) finalizes the game details. Based on this data, it is not hard to identify the group 
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and conclude that the “organizational structure” of the group is represented in Fig. [2] to the right 
(each actor is represented by their first initial). The challenge, once again, is to deduce this same 
information from the communication stream without the message contents Fig. HKb). Two features 
that distinguish the stream from the cycle model are: 

(i) communication waves may overlap, as in Fig. [T](a); 

(ii) waves may have different durations, some considerably longer than others. 

The first feature may result in bursty waves of intense communication (many overlapping waves) 
followed by periods of silence. Such a type of communication dynamics is hard to detect in the 
cycle model, since all the (overlapping) waves of communication may fall in one cycle. The second 
can be quantified by a propagation delay function which specifies how much time may elapse 
between a hidden group member receiving the message and forwarding it to the next member; 
sometimes the propagation delays may be large, and sometimes small. One would typically expect 
that such a streaming model would be appropriate for hidden groups with 
some organizational structure as illustrated in the tree in Fig. [2j We present 
algorithms which discover the streaming hidden group and its organizational 
structure without the use of message content. 

We use the notion of communication frequency in order to distinguish non- 
random behavior. Thus, if a group of actors communicates unusually often 
using the same chain of communication, i.e. the structure of their communi¬ 
cations persists through time, then we consider this group to be statistically 
significant and indicative of a hidden group. We present algorithms to detect 
small frequent tree-like structures, and build hidden structures starting from 
the small ones. 

3 Our Contributions 

We present efficient algorithms which not only discover the streaming hidden group, but also its 
organizational structure without the use of message eontent. We use the notion of communication 
frequency in order to distinguish non-random behavior. Thus, if a group of actors communicates 
unusually often using the same chain of communication, i.e. the structure of their communications 
persists through time, then we consider this group to be statistically anomalous. We present 
algorithms to detect small frequent tree-like structures, and build hidden structures starting from 
the small ones. We also propose an approach that uses new cluster matching algorithms together 
with a sliding window technique to track and observe the evolution of hidden groups over time. We 
also present a general query algorithm which can determine if a given hidden group (represented as 
a tree) occurs frequently in the communication stream. Additionally we propose efficient algorithms 
to obtain the frequency of general trees and to enumerate all statistically significant general trees of a 
specified size and frequency. Such algorithms are used in conjunction with the heuristic algorithms 
and similarity measure techniques to verify that a discovered tree-like structure actually occurs 
frequently in the data. We validate our algorithms on the Enron email corpus, as well as the Blog 
communication data. 

Paper Organization. First we consider related work, followed by the methodologies for the 
streaming hidden groups and tree mining in Section [5j Next we present similarity measure methods 
in Section fTOl We present experiments on real world data and validation results in Section fTOl 
followed by the summary and conclusions in Section [TTl 



Figure 2: Group 

structure in Fig. [T] 
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4 Related Work 


Identifying structure in networks has been studied extensively in the context of clustering and 
partitioning (see for example [a 0 0 [12 la [III [Ill [HI [201 [221 [Ml [Ml [la 133 These approaches focus 
on static, non-planning, hidden groups. In |28] Hidden Markov models are the basis for discovering 
planning hidden groups. The underlying methodology is based on random graphs nnuM] and some 
of the results on cyclic hidden groups were presented in [9]. In our work we incorporate some of 
the prevailing social science theories, such as homophily ([29]), by incorporating group structure. 
More models of societal evolution and simulation can be found in [isiiisiiiniliiEsiEiiisaisi] 
which deal with dynamic models for social network infrastructure, rather than the dynamics of the 
actual communication behavior. 

Our work is novel because we detect hidden groups by only analyzing communication intensities 
(and not message content). The study of streaming hidden groups was initiated in [G], which 
contains some preliminary results. We extend these results and present a general query algorithm 
which can find if a given hidden group (represented as a tree) occurs frequently in the communication 
stream, which we extended to algorithm to obtain the frequency of general trees and to enumerate 
all statistically significant general trees of a specified size and frequency. Such algorithms are used 
in conjunction with the heuristic algorithms and similarity measures to verify that a discovered 
tree-like structure actually occurs frequently in the data. 

Erickson, m, was one of the first to study secret societies. His focus was on general communi¬ 
cation structure. Since the September 11, 2001 terrorist plot, discovering hidden groups became a 
topic of intense research. For example it was understood that Mohammed Atta was central to the 
planning, but that a large percent of the network would need to be removed to render it inoperable 
[MlllZ]. Krebs, m identified the network as sparse, which renders it hard to discover through 
clustering in the traditional sense (finding dense subsets). Our work on temporal correlation would 
address exactly such a situation. It has also been observed that terrorist group structure may be 
changing [33], and our methods are based on connectivity which is immune to this trend. We assume 
that message authorship is known, which may not be true, Abbasi and Chen propose techniques 
to address this issue, [2]- 

5 Problem Statement 

A hidden group communication structure can be represented by a directed graph. Each vertex is 
an actor and every edge shows the direction of the communication. For example a hierarchical 
organization structure could be represented by a directed tree. The graph in Figure [3] to the right 
is an example of a communication structure, in which actor A “simultaneously” sends messages 
to B and C; then, after receiving the message from A, B sends messages to C and D] C sends 
a message to D after receiving the messages from A and B. Every graph has two basic types of 
communication structures: chains and siblings. A chain is a path of length at least 3, and a sibling 
is a tree with a root and two or more children, but no other nodes. Of particular interest are chains 
and sibling trees with three nodes, which we denote triples. For example, the chains and sibling 
trees of size three (triples) in the communication structure above are: A ^ B ^ D; A ^ B ^ C; 
A ^ C ^ D; B ^ C ^ D; A ^ {B,C); and, B —>■ {C,D). We suppose that a hidden group 
employs a communication structure that can be represented by a directed graph as above. If the 
hidden group is hierarchical, the communication graph will be a tree. The task is to discover such 
a group and its structure based solely on the communication data. 

If a communication structure appears in the data many times, then it is 
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likely to be non-random, and hence represent a hidden group. To discover hid¬ 
den groups, we will discover the communication structures that appear many times. 



constraint represents the notion of A sending messages “simultaneously” to 

B and C within a small time interval of each other, as specified by S. For Figure 3: Target 

an entire graph (such as the one above) to appear, every chain and sibling group, 
triple in the graph must appear using a single set of times. For example, in 

the graph example above, there must exist a set of times, {tAB,tAC,tBC,iBD,icD}, which satisfies 
all the six chain and sibling constraints: {tsD - ^ab) € [Tmin,Tmax\i {tBC - tAB) G [Tm.in,rmax], 
{tCD ^Ac) G [Tmin, Tmax]} {tCD ^Bc') G [Tmim'^'max] ^ i^AB ^Ac) G [ (5,(5] and G [ (5,(5]. 

A graph appears multiple times if there are disjoint sets of times each of which is an appearance of 
the graph. A set of times satisfies a graph if all chain and sibling constraints are satisfied by the set 
of times. The number of times a graph appears is the maximum number of disjoint sets of times that 
can be found, where each set satisfies the graph. Causality requires that multiple occurrences of a 
graph should monotonically increase in time. Specifically, if Iab “causes” tsc and “causes” 
with > tAB, then it should be that tfi(j > tsc- In general, if we have two disjoint occurrences 
(sets of times) {ti,t 2 ,---} and {si,S 2 ,...} with si > ti, then it should be that s* > ti for all 
i. A communication structure which is frequent enough becomes statistically significant when 
its frequency exceeds the expected frequency of such a structure from the random background 
communications. The goal is to find all statistically significant communication structures, which is 
formally stated in the following algorithmic problem statement. 

Input: A communication data stream and parameters: 6, Tmin, Tmax, h, k. 

Output: All communication structures of size > h, which appear at least k times, where the 
appearance is defined with respect to 5, Tmin, Tmax- 

Assuming we can solve this algorithmic task, the statistical task is to determine h and k to 
ensure that all the output communication structures reliably correspond to non-random “hidden 
groups”. We first consider small trees, specifically chain and sibling triples. We then develop a 
heuristic to build up larger hidden groups from clusters of triples. Additionally we mine all of the 
frequent directed acyclic graphs and propose new ways of measuring the similarity between sets of 
overlapping sets. We obtain evolving hidden groups by using a sliding window in conjunction with 
the proposed similarity measures to determine the rate of evolution. 

6 Algorithms for Chain and Sibling Trees 

We will start by introducing a technique to find chain and sibling triples, i.e. trees of type A -A 
B ^ C (chain) and trees of type A -A {B,C) (sibling). To accomplish this, we will enumerate all 
the triples and count the number of times each triple occurs. Enumeration can be done by brute 
force, i.e. considering each possible triple in the stream of communications. We have developed 
a general algorithm for counting the number of occurrences of chains of length i, and siblings 
of width k. These algorithms proceed by posing the problem as a multi-dimensional matching 
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problem, which in the case of tipples becomes a two-dimensional matching problem. Generally 
multi-dimensional matching is hard to solve, but in our case the causality constraint imposes an 
ordering on the matching which allows us to construct a linear time algorithm. Finally we will 
introduce a heuristic to build larger graphs from statistically significant triples using overlapping 
clustering techniques [7] . 


6.1 Computing the Frequency of a Triple 

Consider the triple A ^ B ^ C and the associated time lists Li = {ti < t 2 < ... < tn} and 
L 2 = {si < S 2 < • • • < Sm}, where ti are the times when A sent to B and s* the times when B sent 
to C. An occurrence of the triple A —>■ i? —>■ C is a pair of times {ti,Si) such that {si — U) € [Tmin 
Tmax]. Thus, we would like to find the maximum number of such pairs which satisfy the causality 
constraint. It turns out that the causality constraint does not affect the size of the maximum 
matching, however it is an intuitive constraint in our context. 

We now define a slightly more general maximum matching problem: for a pair (tj, Si) let f{ti, Sj) 
denote the score of the pair. 

Let M be a matching {(tq, sq), {ti^, Si^)... {ti^, S 4 )} of size k. We define the score of M as 

k 

Score{M) = 

i=i 

The maximum matching problem is to find a matching with a maximum score. The function f{t, s) 
captures how likely a message from B ^ C at time s was “caused” by a message from A ^ B at 
time t. In our case we are using a hard threshold function 

/./, \ p/, \ jl if^ S (z [Tmim'kmax]^ 

/((,») = /((-») = ! otherwise. 

The matching problem for sibling triples is identical with the choice 


= f{t - s) 


1 if t — s € [—5,5], 
0 otherwise. 


We can generalize to chains of arbitrary length and siblings of arbitrary width as follows. Consider 
time lists Li, L 2 , ... ,T£_i corresponding to the chain Ai —)■ A 2 —>■ A 3 ^ • • • —A^, where Lj 
contains the sorted times of communications Aj —>■ Aj+i. An occurrence of this chain is now an 
£ — 1 dimensional matching {ti,t 2 ,... satisfying the constraint (L+i — ti) G [Tmin Tmax] V 

i = !,■■■,£-2. 

The sibling of width k breaks down into two cases: ordered siblings which obey constraints sim¬ 
ilar to the chain constraints, and unordered siblings. Consider the sibling tree Aq —>■ Ai, A 2 , • • • A^ 
with corresponding time lists Li, L 2 , ... ,Lk, where Lj contains the times of communications 
Aq —>■ Aj. An occurrence is a matching {ti,t 2 ,... ,tk}. In the ordered case the constraints are 
(tj+i — ti) € [—(5 5]. This represents Aq sending communications “simultaneously” to its recipi¬ 
ents in the order Ai,..., A^. The unordered sibling tree obeys the stricter constraint {ti — tj) G 
[—{k — 1)(5, {k — 1)(5], V i,j pairs, i 7 ^ j. This stricter constraint represents Aq sending communica¬ 
tions to its recipients “simultaneously” without any particular order. 

Both problems can be solved with a greedy algorithm. The detailed algorithms for arbitrary 
chains and siblings are given in FigureSl^a). Here we sketch the algorithm for triples. Given two time 
lists Li={ti,t 2 ,... An} and L 2 ={si, S 2 ,..., Sm} the idea is to find the first valid match (tq,sq). 
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1: Algorithm Chain 
2; while Pfc < ||hfc|| — l,V/c do 
3: if {tj - ti) < Tmin then 

4: Pj ^ Pj + 1 

5: else if {tj ^ '^max] then 

6: if j = n then 

7: (Pi, ..., Pn) is the next match 

8: Pk Pk P 1,VA; 

9; i ^ 0; j ■<— 1 

10: else 

11 : + l 

12: else 

13: Pi^Pi + l 

14: J ■<— z; i ■<— i — 1 

(a) 


1: Algorithm Sibling 
2 : while Pk < ||Pfc|| “ 1,VA: do 
3: if {tj — ti) < —{k — 1)(5 then 

4: Pj <— Pj + 1 

5: else if {tj — ti) > {k — 1)5, Vi < j 

then 

6: Pj Pj + 1 

7: j ^ i + 1 

8: else 

9: if J = n then 

10: (Pi,..., Pn) is the next match 

11: Pk Pk P IjV/c 

12: i 0; j ^ 1 

13: else 

14: j ^ j + 1 

(b) 


Figure 4: Maximum matching algorithm for chains and ordered siblings (a); Maximum matching 
algorithm for unordered siblings (b). In the algorithms above, we initialize i = 0;j = 1 (i,j are 
time list positions), and Pi,..., P^ = 0 {Pk is an index within Lk )• Let ti = Li[Pi] and tj = Lj[Pj]. 


which is the first pair of times that obey the constraint (sj^ — Li) & [Tmin Tmax], then recursively find 
the maximum matching on the remaining sub lists L[ = {L^+i,..., tn} and L 2 = {^ii+i, • • ■, Sm}- 
The case of general chains and ordered sibling trees is similar. The first valid match is defined 
similarly. Every pair of entries € Li and € Lj+i in the maximum matching must obey 

the constraint (tLi+i — tLi) S [Tmin Tmax]- To find the first valid match, we begin with the match 
consisting of the first time in all lists. Denote these times tL^pL^^ ■ ■ ■ pLi- If this match is valid (all 
consecutive pairs satisfy the constraint) then we are done. Otherwise consider the first consecutive 
pair to violate this constraint. Suppose it is {tLipLi+i)] so either {tLi+i — tii) > Tmax or {tii+i — 
tLi) < Tmin- If (^Li+i — iLi) > Tmax {tLi is too Small), we advance tLi to the next entry in the time 
list Li] otherwise (tLi+i — tLi) < Tmin (^Li+i is too small) and we advance tLi^^ to the next entry 
in the time list Pj+i. This entire process is repeated until a valid first match is found. An efficient 
implementation of this algorithm is given in Figure 01 The algorithm for unordered siblings follows 
a similar logic. 

The next theorem gives the correctness of the algorithms. 

Theorem 1. Algorithm-Chain and Algorithm-Sibling find maximum matchings. 

Proof. By induction. Given a set of time lists L = {Li, L 2 ,. - -, Pn) our algorithm produces a 
matching M = {mi, m 2 ,..., mk), where each matching mi is a sequence of n times from each of the 
n time lists mi = {t\,t 2 ,... , t\f). Let M* = (m^, m^, ■ ■ ■, ?7z^*) be a maximum matching of size k*. 
We prove that k = k* hy induction on k*. The next lemma follows directly from the construction 
of the Algorithms. 

Lemma 1. If there is a valid matching our algorithm will find one. 

Lemma 2. Algorithm-Chain and Algorithm-Sibling find an earliest valid matching. Let the first 
valid matching found by either algorithm be mi = {ti,t 2 , - - - ,tn), then for any other valid matching 
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m' = (si,S 2 ,... ,s„) tj < Si V i = 1, • • • ,n. 


Proof. Proof by contradiction. Assume that in mi and m' there exists a corresponding pair of times 
s < t and let Sj, be the first such pair. Since mi and m' are valid matchings, then Si and ti obey 
the constraints. Train ^ {^i+l — Tmax} Tmin ^ ifi ti—l) ^ Tmax Uud Tmin ^ Si) ^ Tmaxi 

Train ^ {Si 'Si—l) ^ Tmax- 

Since Si < U, then Tmin < {U+i — Si) and Tmax > {si — ti-i). Also because 
Si-i > ti-i, we get that Tmin < (si - tj-i) and since (sj+i - s*) < Tmax, then 
(mm(ti+i, Si+i) — Si) < Tmax as well. But if Si satisfies the above conditions, then mi 
would not be the first valid matching, because the first matching mj would contain mj = 

(ti, ^2 ) ■ ■ ■) ti—i, Si , min(ti^i , Sj+i), min{ti-\-2, ^ 1 + 2 ); • • •; minita, Sn))- 

Let us show this by induction on the number of pairs p of the type min{ti+j, .Si+j), where Si < ti 
and j > 1. 

If p = 1, then j = 1, and since Tmin < (sj+i - Si) < Tmax and Tmin < {ti+i - Si), then 
Tmin < {iTiin{ti^i, Si+i) — Si) < Tmax as Well, and therefore satisfies the matching constraints. 

Let the matching constraints be satisfied up to p = m, such that in the matching 
m* = {ti,t 2 ,...,ti-i,Si,min{ti+i,Si+i),...,min{ti+m,Si+m),---,min{tn,Sn)) the sequence of el¬ 
ements of m* up to min{ti+m, Si+m) satisfy the matching constraints. Then we can show that 
min{ti^m+i, Si+m+i) is also a part of the matching. Since mi and m' are both valid matchings, 
then Tmin ^ ifi+m+l L+m) ^ Tmax and Tmin ^ ('Si+m+1 SiJ^rn) ^ Tmax, from which We get that 
Tmin ^ {jTlinitiJ^m+l, Si-,^m+l) Tnin{tiJ,-m, Si-i^rn)) ^ Tmax- Therefore, mi7l(tj-|_m+l; Si+m-|-l) is also 
a part of the matching. 

Thus, we get a contradiction since mj would be an earlier matching if there exists a pair of times 
Si < ti. Therefore, Algorithm-Chain and Algorithm-Sibling find an earliest valid matching. □ 

If k* = 0, then A; = 0 as well. If A:* = 1, then there exists a valid matching and by Lemma [T] our 
algorithm will find it. 

Suppose that for all sets of time lists for which k* = M, the algorithm finds matchings of size 
k*. Now consider a set of time lists L = (Li, L 2 ,..., Ln) for which an optimal algorithm produces a 
maximum matching of size k* = M + 1 and consider the first matching in this list (remember that 
by the causality constraint, the matchings can be ordered). Our algorithm constructs the earliest 
matching and then recursively processes the remaining lists. By LemmaO our first matching is not 
later than optimal’s first matching, so the partial lists remaining after our first matching contain 
the partial lists after optimal’s first matching. This means that the optimal matching for our partial 
lists must be M. By the induction hypothesis our algorithm finds a matching of size M on these 
partial lists for a total matching of size M + 1. □ 

For a given set of time lists L = {Li,L 2 ,... ,Ln) as input, where each Li has a respective size 
di, define the total size of the data as ||D|| = ^ 11=1 ^i- 

Theorem 2. Algorithm-Chain runs in 0(||T)||) time. 

Theorem 3. Algorithm-Sibling runs in 0{n - ||T*||) time. 

6.2 Finding all Triples 

Assume the data are stored in a vector. Each component in the vector corresponds to a sender id 
and stores a balanced search tree of receiver lists (indexed by a receiver id). And let S be the whole 
set of distinct senders. The algorithm for finding chain triples considers sender id s and its list of 
receivers {ri, r 2 , • • • , r^}. Then for each such receiver r* that is also a sender, let {pi, P 2 ) • • • ) P/} be 
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the receivers to which rj sent messages. All chains beginning with s are of the form s ^ ri ^ pj. 
This way we can more efficiently enumerate the triples (since we ignore triples which do not occur). 
For each sender s we count the frequency of each triple s ^ Vi ^ pj. 

Theorem 4. Algorithm to find all triple frequencies takes 0(||T)|| + n • ||T)||) time. 

6.3 General Scoring Functions for 2T)-Matching 

One can observe that for our 2T)-matching we are using a so called “Step Function”, which returns 
1 for values between Tmax], and gives 0 otherwise. Such a function represents the probability 

delay density which is the distribution of the time it takes to propagate a message once it is received. 


Probability 




of reaction 






min 

max 


Probability 
of reaction 



Time since message received 


Time since message received 


Figure 5: Step function on the left and a General Response Functions for 2D Matching on the right 

Here we extend our matching algorithm to be able to use any general propagation delay density 
function, see Figure El 

Usage of these various functions may uncover some additional information about the streaming 
groups and their structure which the “Step Function” missed. 

Unfortunately, the matching problem with an arbitrary function, unlike in the case with the 
“Step Function” which can be solved in linear time, cannot be solved so efficiently. 

First we provide an efficient algorithm to find a 2D maximum matching which satisfies a causal¬ 
ity constraint (a maximum weight matching which has no intersecting edges). Additionally we will 
provide an approach involving the Hungarian algorithm to discover a maximum weighted 2D- 
matching, which does not obey the causality constraint (edges involved in the maximum matching 
may intersect). 

Given the two time lists Li = {ti,t 2 , ■ ■ ■, G} and L 2 = {si, S 2 , • • •, Sm} and a general scoring 
function /(•) over the specified time interval [Tmin,rmax] would like to find a maximum weighted 
2d matching between these two time lists, such that the matching has no intersecting edges. No 
intersecting edges intuitively guaranties the causality constraint. To solve this problem we will 
employ the dynamic programming approach. Let Mij be a maximum matching with the respective 
weight w{Mij), obeying the causality constraint, involving up to and including the tfth. item of the 
list Li and up to and including the Sj’th item in the list L 2 - Thus, the matching Mn,m will hold the 
maximum weighted matching for the entire lists Li and L 2 - When we compute the matching, we 
attempt to improve it from step to step by adding only the edges (matches) which do not intersect 
any of the edges already present in the matching. The following description of the algorithm will 
show why it is the case. 

We will illustrate now that if we have correct solutions to subproblems Mjj_i and 

Mi-ij-i, then we can construct a maximum matching Mij, which obeys the causality constraint 
by considering the following two simple cases: 
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1 : Algorithm Match-Causality 

2 : Compute and {Mpi, M 2 ,i,..., Mm,i} 

3: for i = 2; i < n; i + + do 
4; for j = 2; j < m; j + + do 

5: Mjj = max{w{Mi_ij_i U {U, Sj)),w{Mi_ij),w{Mij_i)} 

6 : Store a direction for backtracking 

7: Start at Mm,n and backtrack to retrieve the edges of the matching 

Figure 6: Algorithm to discover a maximum weighted matching which obeys the causality con¬ 
straint. In the algorithm above, we initialize i = 0;j = 0 (i,j are time positions in lists 
L\ — ■ ■ ■ itfi} and L 2 — {si; ^ 2 j ■ ■ ■; 

1. Either the elements ti and sj are both matched to each other in the matching Mjj, in which 

case Mjj = U {ti,Sj). Obviously the edge does not intersect any of the 

previous edges of so we maintain the causality constraint; 

2. Or, the elements ti and sj are not matched to each other in the matching Mij. Then, one of 
the ti or Sj is not matched (see Lemma 0]), which means that Mij = 

No edges are added to the matching in this case. 

We initialize our algorithm by computing in linear time the base set of matches 
Mi^ 2 , ■ ■ ■, Mi^n} (the bottom row) and M 2 p,..., Mm,i} (the left most column) of 

the two-dimensional array of subproblems (of size n • m) that is being built up. The matchings 
{Mi^i, Mi^ 2 ; • • • j Mi^n} are constructed by taking the first element si from the list L 2 and comput¬ 
ing all of the weights of the edges w{ti,si), s.t. w{Mi^i) = {/(si — ti)} (contains edge (ti,si), if 
its not 0), w{Mi^ 2 ) = rnax{f{si —ti),f{si — ^ 2 )} (contains the heavier of two edges (ti,si), (t 2 )'Si) 

) up to Mi^n = max{f{si — ti), f{si — t 2 ), ■ ■ ■, f{si — tn)} (contains the edge of maximum weight 
considered over all L’s). We similarly compute the set of matchings M 2 p,..., Next 

we are ready to fill in the rest of the two-dimensional array of subproblems starting with M 2 , 2 ) 
since Mi,i, Mi ,2 and M 2 ,i are all available. The pseudo code of the algorithm is given in Figure [6j 

Lemma 3. The matching constructed by algorithm Match-Causality, obeys the causality constraint 
(contains no intersecting edges). 

Proof. By construction of our algorithm, during the computation of every Mij a new edge is added 
to the matching only if the (Mj_i,j_i U {ti, Sj)) is picked as maximum. But since ti and Sj are the 
very last two elements for the matching Mj,j, they can’t intersect any of the edges. Thus, since at 
each step our algorithm consistently adds edges which do not intersect any of the previously added 
edges, the final matching will contain no intersecting edges. □ 

Lemma 4. If the items U and sj are not matched to each other in the matching Mij, then one of 
the ti, Sj is not matched at all. 

Proof. Let us assume for the sake of contradiction that both ti and Sj are matched with some 
nodes. This automatically implies that ti must be matched with some Sji, which appears before 
the Sj in the list L 2 ', and Sj is matched with some tj/, which occurs before the ti in the list Li. But 
this means that the edges {ti,Sji) and {tii,Sj) intersect, a contradiction. □ 

Theorem 5. Algorithm Match-Causality correctly finds a maximum weighted matching. 


10 



Proof. Proof by induction. For the base case lets consider the case where ||Fi|| = 1 and IIF 2 II = Ij 
in this case the algorithm will trivially match ti (the only element of Li) with si (the only element 
of L 2 ) as long as the /(si — ti) > 0 , otherwise the matching would be empty. 

For the inductive step we assume that if our algorithm finds all of the maximum weighted 
matchings ,which obey the causality constraint, correctly up to and including then the 

algorithm correctly finds the maximum matching which obeys the causality constraint for Mij 
(the very next position it considers after By our assumption we know that our algo¬ 
rithm correctly found the matchings and which all obey the causality 

constraint, since all of them occurred before the computation of Mij. If so, then our algorithm 
by construction will pick the maximum weight matching from the set of 3 possible matchings 
{(Afj_ij_i U (tj, Sj)), which guaranties the Mjj to be maximum weight and obey 

the causality constraint. □ 

Theorem 6. Algorithm Match-Causality runs in 0{n ■ m) time. 

The general propagation delay function /(•) can have any shape, and one can wonder if it is 
possible to find an algorithm which will perform faster then 0 (n • m) for some special case of the 
general propagation delay function. Let us consider one of the most intuitive scenarios where the 
propagation delay function is monotonically decreasing. We prove that there does not exist an 
algorithm which can construct the maximum weight matching in less then 0(n ■ m) time, which 
obeys the causality constraint. 

Theorem 7. Algorithm which finds exactly the maximum weight matching for a propagation delay 
function which is strictly monotonically decreasing (not a “step” function) and obeys the causality 
constraint, requires at least 0{n ■ m) time. 

Proof. Consider the two time lists Li = {ti,t 2 ,..., tn}, L 2 = {si, S 2 ,..., Sm}, where every time 
Sj > tn, and a strictly monotonically decreasing function /(•), s.t. /(sm — ti) > 0. The first 
observation to make is that tn must be a part of the matching. If tn' is the last matched item and 
tn is not matched, where n' < n, then the matching can be improved by replacing tn' with tn, since 
/(•) is a strictly monotonically decreasing function and n' < n. 

If the matching obeys the causality constraint, then the maximum weight matching can be 
{/(si - in)} or {/(S 2 - tn) + /(si “ in-l)} Or . . . Or {/(si - ti) -\- f{s 2 - ^ 2 ) + • • • + f{Sm - in)}, 
order of 0{n ■ m) combinations. And since the function is any strictly monotonically decreasing 
function, one can’t guaranty the optimality of the discovered matching without having to consider 
all of the mentioned 0{n ■ m) permutations. Thus an algorithm which finds exactly the maximum 
weight matching for a propagation delay function which is strictly monotonically decreasing (not a 
“step” function) and obeys the causality constraint, requires at least 0{n ■ m) time. □ 

Additionally we present a method to discover a maximum weight matching for a general propa¬ 
gation delay function, which doesn’t have to obey the causality constraint (we allow the intersection 
of edges in the matching). The general idea is to use a Hungarian algorithm to find a maximum 
weighted 2 (i-matching for a pair of time lists. 

First, given two time lists Li = {ti,t2,..., tn} and L2 = {si, S2,..., Sm} and a general scoring 
function /(•) over the specihed time interval {Tmin,Tmaa\i construct the bipartite graph, where on 
the left we have the set of n nodes, where each node represents a respective time from {ti,t 2 ,... ,tn} 
and on the right we have a set of m nodes representing each of {si, S 2 , ■ ■ ■, Sm} times respectively. 
Each pair of nodes ti and Sj is connected by an edge, where the weight on the edge equals to 
f{sj - ti) (0 if outside the [Tmin,TmaJ bounds). 
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Once we have constructed the bipartite graph we are ready to run the Hungarian algorithm. 
The produced matching M is of maximum weight, but does not take into account the causality 
constraint (some of the edges of M may intersect). This algorithm runs in cubic time. 

We use ENRON data to test general propagation delay functions against the “step” function. 
The results of our experiments are presented in Section [T3l It turns out that in most of the cases 
there is not much added value from the more general propagation delay function in practice. Thus, 
the more efficient function seems adequate. 

7 Statistically Significant Triples 

We determine the minimum frequency k that makes a triple statistically significant, using a statisti¬ 
cal model that mimics certain features of the data: we model the inter-arrival time distribution and 
receiver id probability conditioned on sender id, to generate synthetic data and find all randomly 
occurring triples to determine the threshold frequency k. 

7.1 A Model for the Data 

We estimate directly from the data the message inter-arrival time distribution /(r), the conditional 
probability distribution P(r|s), and the marginal distribution P{s) using simple histograms (one 
for /(t), S for P{r\s) and S for P{s), i.e. one conditional and marginal distribution histogram 
for each sender, where S is the number of senders). One may also model additional features (e.g. 
P(s|r)), to obtain more accurate models. One should however bear in mind that the more accurate 
the model, the closer the random data is to the actual data, hence the less useful the statistical 
analysis will be - it will simply reproduce the data. 

7.2 Synthetic Data 

Suppose one wishes to generate N messages using /(r), P{r\s) and P{s). First we generate N inter¬ 
arrival times independently, which specifies the times of the communications. We now must assign 
sender-receiver pairs to each communication. The senders are selected independently from P{s). We 
then generate each receiver independently, but conditioned on the sender of that communication, 
according to P(r|s). 

7.3 Determining the Significance Threshold 

To determine the signihcance threshold k, we generate M (as large as possible) synthetic data 
sets and determine the triples together with their frequencies of occurrence in each synthetic data 
set. The threshold k may be selected as the average plus two standard deviations, or (more 
conservatively) as the maximum frequency of occurrence of a triple. 

8 Constructing Larger Graphs using Heuristics 

Now we discuss a heuristic method for building larger communication structures, using only statis¬ 
tically significant triples. We will start by introducing the notion of an overlap factor. We will then 
discuss how the overlap factor is used to build a larger communication graph by finding clusters, 
and construct the larger communication structures from these clusters. 
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8.1 Overlap between Triples 

For two statistically significant triples {A,B,C) and {D,E,F) (chain or sibling) with maximum 
matchings at the times Mi = {(ti, si),..., Sk)} and M 2 = {{t[, s'l),..., 5^}, we use an 

overlap weighting function W{Mi, M2) to capture the degree of coincidence between the matchings 
Ml and M 2 . The simplest such overlap weighting function is the extent to which the two time 
intervals of communication overlap. Specifically, W{Mi, M2) is the percent overlap between the 
two intervals [ti,Sfc] and [Ti,Sp]: 


1T(Mi,M2) 


r min(sfc,sP - ma.x{ti,t[) 

\ max(sfc, s'p) - min(ti,Ti) ’ / 


A large overlap factor suggests that both triples are part of the same hidden group. More so¬ 
phisticated overlap factors could take into account intermittent communication but for our present 
purpose, we will use this simplest version. 


8.2 The Weighted Overlap Graph and Clustering 

We construct a weighted graph by taking all significant triples to be the vertices in the graph. Let 
Mj be the maximum matching corresponding to vertex (triple) Uj. We define the weight of the 
edge Cij to be uj{eij) = W{Mi, Mj), producing an undirected complete graph (some weights may be 
0). By thresholding the weights, one could obtain a sparse graph. Dense subgraphs correspond to 
triples that were all active at about the same time, and are a candidate hidden group. We want to 
cluster the graph into dense possibly overlapping subgraphs. Given the triples in a cluster we can 
build a directed graph, consistent with all the triples, to represent its communication structure. 
Cluster containing multiple connected components implies the existence of some hidden structure 
connecting them. Below is an outline of the entire algorithm: 

1 : Obtain the significant triples. 

2 : Construct a weighted overlap graph (weights are overlap factors between pairs of triples). 

3: Perform clustering on the weighted graph. 

4: Use each cluster to determine a candidate hidden group structure. 

For the clustering, since clusters may overlap, we use the algorithms presented in [ail]. 


9 Algorithm for Querying Tree Hidden Groups 

We describe efficient algorithms for computing (exactly) the frequency of a hidden group whose 
communication structure is an arbitrary pre-specified tree. We assume that messages initiate 
from the root. The parameters Tmin,Tmax,S are also specified. Such an algorithm can be used in 
conjunction with the previous heuristic algorithms to verify that a discovered tree-like structure 
actually occurs frequently in the data. 

Let L be an adjacency list for the tree T, D a dataset in which we will query this tree. The first 
entry in the list L is the root communicator followed by the list of all its children (receivers) the 
root sends to. The next entries in L contain the lists of children for each of the receivers of Lroot 
until we reach the leaves, which have no children. 

After we have read in D, we process L and use it to construct the tree, in which every node will 
contain: node id, time list when its parent sent messages to it, and a list of children. We construct 
such tree by processing L and checking each communicator that has children if it is present in D as 
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1: Algorithm Tree-Mine(T,L)) 

2 : i D 

3: while M = TRUE do 

4: = FindNext{T, Drem) 

5: if M then 

6 : Store Match 

7: Increment List Pointers; get Drem 


1 : Algorithm i^m(iAext(T, H^em) 
2 : Initialize all truthj 0 
3: return Findnextrec{d^ULL, root) 


1: Algorithm FindNextrec{t,*node i) 

2 ; (*)Run Algorithm-Sibling from current time list pointers to get m = (ti,... ,tn) 

3: if m ^ t then 
4: for j = n to 1 do 

5: if {truthj = 1 &: prevj < tj) or truthj = 0 then 

6 ; {truthj,prevj) = FindNextrec{tj, *node j) 

7: if truthj = 0 then 

8 ; Increase tj pointer, GOTO(*) 

9: else 

10 ; return {truthj,t) 

11: if m < t then 

12 : Increase tj pointer, GOTO(*) 

13: if m > t then 
14: return (0, t) 

Figure 7: Algorithms used for Querying a Tree T in the data D. In the algorithms above, Drem 
represents D in an way that allows the Tree-Mine Algorithm to efficiently access the necessary 
data. 
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a Sender, and if its children are present in D in the list of its Receivers. During the construction, 
if a node that has children is not present in D as a Sender, or some child is not on the list of 
Receivers of its parent, then we know that the given tree does not exist in the current data set D 
and we can stop our search. 

For a tree to exist there should be at least one matching involving all of the nodes (lists). We 
start with root and consider the time lists of its children. We use Algorithm-Sibling to find the 
first matching mi = {ti,t 2 , ■ ■ ■ ,tn), where ti is an element of the i’s child time list and n is the 
number of time lists. After the first matching mi we proceed by considering the children in the 
matching mi from the rightmost child to the left by taking the value ti, which represents this 
node in the matching and passing it down to the child node. Next we try to find a matching 
nT '2 = (si) 'S 2 , ■ ■ ■, Sk) for the k child time lists. There are three cases to consider: 

1. Every element sj of the matching m 2 also satisfies the chain-constraint with the element 
Tmin < Sj - ti < Tmax, '^Sj G m2, j = k,... ,1. In this case we say m 2 ti {m2 matches U) 
and proceed by considering all children. Otherwise consider the rightmost Sj G m 2 . The two 
cases below refer to Sj. 

2. If Sj < ti + Train, in which case we say m 2 < ti, we advance to the next element in child j's 
time list and continue as with Algorithm-Sibling to find the next matching (s'^,S 2 ,... 

This process is repeated as long as m 2 < ti. Eventually we will find an m 2 with m 2 ~ ti or 
we will reach the end of some list (in which case there is no further matching) or we come to 
a matching m 2 > ti (see case below). 

3. If Sj > ti + Tmax, in which case we say m 2 > t*, we advance ti to the next element in i’s time 
list on the previous level and proceed as with Algorithm-Sibling to find the next matching in 
the previous level. After this new matching {t'i,t' 2 ,..., t'^) is found, the chain constraints have 
to be checked for these time lists {t'i,t 2 ,... ,t(j) with their previous level and the algorithm 
proceeds recursively from then on. 



Eigure 8: Example of a communication tree structure 

The entire algorithm for finding a complete matching can be formulated into two steps: find the 
first matching; recursively process the remaining parts of the time lists. What we have described is 
the hrst step which is accomplished by calling the recursive algorithm FindNextrec{R^ULL,root) 
that is summarized in the Eigure [71 If this returns TRUE, the algorithm has found the hrst 
occurrence of the tree, which can be read off from the current time list pointers. After this instance 
is found, we store it and proceed by considering the remaining part of the time lists starting from 
the root. 

To illustrate how the Algorithm Tree-mine works, consider the example tree T in Eigure [3 Let 
node A to be a root and let Li,... ,L^ be the time lists. Refer to {Li, L 2 , L 3 ) as the phasei lists. 
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(^4, L5,Lq) as th.ephase2 lists and (L7, Lg) as the phases lists. Let mi = (ti, t2) ^3)1 "^2 = (sij S2, S3) 
and m3 = (ri,r2) be the first matchings of the phasei,phase2 and phases lists respectively. If 
m2 ~ ^3 and m3 ~ ti, we have found the first matching and we now recursively process the 
remaining time lists. If m2 < tg (eg. S2 < ^3 + Tmin); then we move to the next matching in the 
phase2 lists. If m2 > then we move to the next matching in phasei lists and reconsider the 
phase2 matching and the phases matching if necessary. If m2 ~ we then similarly check m3 with 
ti- Since node C is a leaf, it need not be further processed. 

Theorem 8. Algorithm Tree-Mine correctly finds the maximum number of occurrences for a spec¬ 
ified tree T. 

Proof. Proof by contradiction. Given a set of time lists L = {Li,L2, ■ ■ ■, Ln) that specify a tree, 
our algorithm produces a matching M = {mi, m2, ..., m^), where each matching m^ is a sequence 
of n times from each of the n time lists m* = {ti,t2, ■ ■ ■ ,t]f). Let M* = (m|,m^,...,m^*) be 
a maximum matching of size k*. The next lemma follows directly from the construction of the 
Algorithms. 

Lemma 5 . If there is a valid matching our algorithm will find one. 

Lemma 6. Algorithm Tree-Mine finds an earliest valid matching (occurrence). Let the first valid 
matching found by our algorithm be mi = {ti,t2,... ,tn), then for any other valid matching m! = 
(si, S2, . . . , Sn) ti 'fi: Si y i — 1) ■ ■ ■ ) 

Proof. Proof by contradiction. Assume that in mi and m' there exists a corresponding pair of 
times s < t and let Si, ti be the first such pair. Since mi and m' are valid matchings, then 
Si and ti obey the chain constraints: Tmin < {ti — tp) < Tmax (where tp is a time passed down 
by the parent node), Tmin < {{tchildren - U) < Tmax (where tchildren are times of children of U), 
similarly Tmin < (si - Sp) < Tmax, Tmin < {schildren “ Sj) < Tmax] and obey the sibling constraint: 
Tmin < {ti+i - ti) < Tmax, Tmin < {U “ tj-i) < Tmax (where tj-i and ti+i are matched times of 
neighboring siblings of U) and similarly Tmin < (si+i - Si) < Tmax, Tmin < {si - Si-i) < Tmax- 

Since Si < U, then Tmin < {ti+i — Si) and Tmax > {si—ti-i). Also because Sj_i > tj-i, we get that 
Tmin < {si-ti-i) and since (sj+i-Sj) < Tmax, then (mm(tj+i, Sj+i)-Sj) < Tmax as well. By similar 
reasoning since Si < U, then Tmin < {tchUdren-Si) and Tmax > {si-tp)] also since Sp > tp, we get that 
Tmin {Si tp) and since {SchUdren Si) ^ Tmax, then {min{tchildren, Schildren) Si) ^ Tmax aS well. 
But if Si satisfies all of the chain and sibling constraints, then mi would not be the first valid match¬ 
ing as has already been proven for algorithms chain and sibling triples, because the first matching 
mf would contain m/ = (ti,t2, • • • ,L-i, Sj, mm(ti+i, Sj+i), mzn(tj+2, Si+2), ■ ■ ■ ,min{tn, s^)). Thus, 
algorithm Tree-Mine finds the earliest possible matching(occurrence). □ 

Now let us for the purpose of contradiction assume that Tree-Mine does not find a maximum 
number of occurrences of a specified tree T, s.t. k < k*. 

The situation where k < k* can only appear if M* discovers an occurrence of T before the 
Tree-Mine does, s.t. some occurrence m* which is earlier then its respective occurrence m^. But 
such a situation can not happen, since, given the set of time lists (or the remainder of them, if we 
already processed some of them) which define the tree T, Tree-Mine guaranties to find the earliest 
valid match by Lemma [6l Thus, we obtain a contradiction. This proves that Tree-Mine correctly 
finds the maximum number of occurrences of a specified tree T. □ 

Theorem 9 . Algorithm Tree-Mine runs in 0 {dmax ■ ||L*||). 
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9.1 Mining all Preqnent Trees 

Here we propose an algorithm which allows us to discover the frequency of general trees and to 
enumerate all statistically signihcant general trees of a specified size and frequency. The parameters 
Train, Tmax,S and K must be Specified. Additionally you can specify the min and the max tree size 
to bound the size of the trees of interest. The parameter k in this algorithm represents the minimal 
frequency threshold, and is used to discard the trees which occur fewer times then the specified 
threshold. 

As for any tree mining problem, there are two main steps for discovering frequent trees. First, 
we need a systematic way of generating candidate trees whose frequency is to be computed. Second, 
we need efficient ways of counting the number of occurrences of each candidate in the database 
D and determining which candidates pass the threshold. To address the second issue we use our 
Algorithm Tree-Mine to determine the frequency of a particular tree. To systematically generate 
new candidate trees we inherit the idea of an Equivalence Class-based Extensions and the Rightmost 
Path Extensions proposed and described in [38], |37] . 

The algorithm proceeds in the following order: 

(i) Systematically generate new candidates, by extending only the frequent trees until no more 
candidates can be extended; 

(ii) Use Algorithm Tree-Mine to determine the frequency of our candidates; 

(iii) If the candidate’s frequency is above threshold - store the candidate. 

The main advantage of equivalence class extensions is that only known frequent elements are 
used for extensions. But to guaranty that all possible extensions are considered, the non-redundant 
tree generation idea has to be relaxed. In this way the canonical class (considers candidates only 
in canonical form) and equivalence class extensions represent a trade-off between the number of 
isomorphic candidates generated and the number of potentially frequent candidates to count. 

Theorem 10. Mining all of the tress on the current level requires 0{n^ ■dmax'\\D\\'{v + log{dmax))) 
operations. 

10 Comparing Methods 

To compare methods, we need to be able to measure similarity between sets of overlapping clusters. 
We will use the Best Match approach proposed in m, which we briefly describe here. 

We formally define the problem as follows: 

• Let Cl = {5i, S 2 ,..., ^n} and C 2 = {S'], S 2 , ■ ■ ■, S!^} be the two clusterings of size n and m 
respectively, where St and S' are the groups that form the clusterings. A group does not 
contain duplicates. 

• Let (^ 2 ) be the distance, between the clusterings Ci and C 2 . 

• The task is to find D{Ci,C 2 ) efficiently, while ensuring that L^(Ci,C 2 ) reflects the actual distance 
between the network structures that Ci and C 2 represent. 

The Best Match algorithm determines how well the clusterings represent each other. That 
is when given Ci = {Si, S 2 ,..., S„} and C 2 = {S], S 2 ,..., S(„} it will determine how well C 2 
represents Ci and vice-versa. 

We begin by considering every group S G Ci and finding a group S' G C 2 with the min distance 
d(s,S') between them. The best match algorithm can run with any set difference measure which 
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measures the distance between two sets S, S'. We define the distance ^(5^5/) between the two groups 
S and S' as the number of moves(changes) necessary to convert S into S': 

d^s,s') = \S\ + \S'\- 2 \SnS'\ 


Note, that alternatively we can also dehne ^(5^5/) as: 

_ |Sn5'| 

'i(s.s’) - 1 - 

As we step through Ci, we hnd for each group Sk € Ci the closest group S'j^ G C2 with a minimal 
distance 5'): 

Next we sum up all such distances. For the purposes of normalization one can normalize the 
obtained sum by the total number of distinct members in Ci to obtain D(^(j^ c2)' 

D{Ci,C2) = ^^=1 , 

^ C\ 

Tc, = II ULiSfcll; 

this normalization computes a distance per node. One can also normalize by HCiH and ||C'2||. 

So far we successfully found the distance measure D{Ci,C2) of groups in Ci are 

represented in 02 - If this asymmetric measure of the distance is considered adequate, one may stop 
the algorithm here. However, since in most of the cases we want the measure to be symmetric with 
respect to both clusterings, we also want to know how well Ci represents 6*2. We will thus repeat 
the same calculation for each group in C2 with respect to the groups in Ci and normalize the sum 
of distances using one of the normalization methods. Finally, the Best Match symmetric distance 
between a pair of clusterings Ci and C2 defined as; 


DBestMatchi.C'l j C2') 


-PfCi.Ca) + ^iC 2 ,Ci) 

2 


This result can be viewed as a representation of the average number of moves per distinct 
member (or set) necessary to represent one clustering by the other. 

Intuitively the Best Match algorithm is a relative measure of distance, and reflects how well 
two clusterings represent each other, and how similar/different are the social networks formed by 
these clusterings. This approach is not sensitive to having clusterings of different size or having 
overlapping sets. Refer to | 18 ] for more details. 


11 Enron Data 

The Enron email corpus consists of emails released by the U.S. Department of Justice during 
the investigation of Enron. This data includes about 3.5 million emails sent from and to Enron 
employees between 1998 and 2002 . The list of approximately 150 employees mailboxes constitute 
the Enron dataset. Although the dataset contains emails related to thousands of Enron employees, 
the complete information is only known for this smaller set of individuals. The corpus contains 
detailed information about each email, including sender, recipient(s) (including To, CC, and BCC 
fields), time, subject, and message body. We needed to transform this data into our standard input 
format (sender, receiver, time). To accomplish this, for each message we generated multiple entries 
(sender, receiverl, time), ... (sender, receiverN, time), for all N recipients of the message. 
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12 Weblog (Blog) Data 

This data set was constructed by observing Russian livejournal.com blogs. This site allows any 
user to create a blog at no cost. The user may then submit text messages called posts onto 
their own page. These posts can be viewed by anyone who visits their page. For the pur¬ 
poses of our analysis we would like to know how information is being disseminated throughout 
this blog network. While data about who accessed and read individual home pages is not avail¬ 
able, there is other information with which we can identify communications. When a user vis¬ 
its a home page, he or she may decide to leave comments on one or more posts on the page. 
These comments are visible to everyone, and other users may 
leave comments on comments, forming trees of communica¬ 
tions rooted at each post. We then must process this in¬ 
formation into links of the form (sender, receiver, time). We 
make the following assumptions for a comment by user a at 
time t in response to a comment written by user b, where 
both comments pertain to a post written by user c: a has 
read the original post of c, hence a communication (c, a, t) 
if this was the earliest comment a made on this particular 
post, c reads the comments that are made on his site, hence 
a communication (a, c, t); a read the comment to which he Figure 9: Communications inferred 
is replying, hence the communication {b,a,t)-, b will monitor from weblog data, 
comments to his comment, hence the communication (a, 6, t); 

Fig. E] shows these assumed communications. Note that the second post by user a only generates a 
communication in one direction, since it is assumed that user a has already read the post by user 

c. 

In addition to making comments, LiveJournal members may select other members to be in their 
“friends” list. This may be represented by a graph where there is a directed edge from user a to 
user 6 if o selects 6 as a friend. These friendships do not have times associated with them, and so 
cannot be converted into communication data. However, this information can be used to validate 
our algorithms, as demonstrated in the following experiment. 

The friendship information may be used to verify the groups that have been discovered by our 
algorithm. If the group is indeed a social group, the members should be more likely to select each 
other as a friend than a randomly selected group. The total number of members in the friendship 
network is 2,551,488, with 53,241,753 friendship links among them, or about 0.0008 percent of all 
possible links are friendship links. Thus, we would expect about 0.0008 percent of friendship links 
to be present in a randomly selected group of LiveJournal members. 

13 Experimental Results 

13.1 Triples in Enron Email Data 

For our experiments we considered the Enron email corpus (see Section do]). We took Tmin to be 
1 hour and Tmax to be 1 day. Fig. [TO] compares the number of triples occurring in the data to 
the number that occur randomly in the synthetically generated data using the model derived from 
the Enron data. As can be observed, the number of triples in the data by far exceeds the random 
triples. After some frequency threshold, no random triples of higher frequency appear - i.e., all the 
triples appearing in the data at this frequency are significant. We used M = 1000 data sets to 
determine the random triple curve in Fig. dOl 
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The significance thresholds we discover prove that the probability of a triple occurring at random 
above the thresholds is in practice very close to zero. In other words, the observed probability B 
of a random triple occurring above the specified threshold is 0, however the true probability T of a 
random triple occurring above the threshold is not 0. Thus to put a bound on the true probability 
T, we use the Chernoff bound: P{T < e) > 1 — , where n is the number of random sets 

we generated (M = 1000) and e would be an error tolerance. Setting e = 0.05, we have that the 
probability of P{T < 0.05) > 0.9933. 




(a) (b) 

Figure 10: Abundance of triples occurring as a function of frequency of occurrence, (a) chain 
triples; (b) sibling triples 





Figure 11: Evolution of part of the Enron organizational structure from 2000 - 2002. Note: actors 
B, C, D, F present in all three intervals. Here is who they are: B - T. Brogan, C - Peggy Heeg, D 
- Ajaj Jagsi and F - Thresa Allen. 


13.2 Experiments on Weblog Data 

Similar experiments were run on the Weblog data to obtain communication groups (see Section [12] 
for a description of the Weblog data). As a validation we used a graph of friendship links, which 
was constructed from friendship lists of people who participated in the conversations during that 
period. Eig. 1121 shows one of the groups found in the Weblog data and the corresponding friendship 
links between the people who participated in that group. The fraction of friendship links for this 
group of 24 actors is 2.5%, again well above the 0.0008% for a randomly chosen group of 24 actors. 
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Figure 12: Validation of Weblog group communicational structure on the left against actual friend¬ 
ship links on the right. 


13.3 General Scoring Functions vs. “Step” Function Comparison 

Here we would like to present a comparison of a general scoring function and a “step” function. 
We will compare a given general propagation delay functions Gi, G 2 , G 3 and G 4 , to a “best ht” 
step function, see Figure [T^ 




Figure 13: Step function H and a General Response Function Gi for 2D Matching on the left and 
Exponential Decay Response Function G 4 on the right. 

Functions G 2 and G 3 are respectively linear monotonically increasing and linear monotonically 
decreasing functions, while G 4 is generated using a well known exponential distribution of the form 
(y = A • We generated Gi using a cubic splines interpolation. 

For the purposes of this experiment we used an Enron dataset, where we looked at the data which 
represents approximately one year of Enron communications and consists of 753, 000 messages. We 
obtained a set of triples of H for the step function and a set of triples of Gi, G 2 , G 3 and G 4 for the 
general propagation functions with causality constraint and G'^, G 2 , G'^ and G 4 without causality 
constraint. Next we used our distance measure algorithms to measure the relative distance between 
these graphs. EigureHJ] shows the discovered relative distances. 

The results indicate that functions H, G 3 and G 4 produce very similar sets of triples, which 
is explained by the fact that the most of the captured triples occur “early” and therefore are 
discovered by these somewhat similar functions. Also we can notice that Gi and G 2 find different 
sets of triples, while Gi still has a signihcant overlap with 77, we can explain this behavior by the 
fact that Gi and G 2 have peaks in different time intervals and thus capture triples occurring in 
those intervals. 
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H 

Gi 

G 2 

G 3 

G 4 

Gi 

0.63 

- 

0.34 

0.66 

0.67 

g; 

0.64 

0.98 

0.34 

0.65 

0.67 

G 2 

0.22 

0.34 

- 

0.33 

0.18 

G '2 

0.23 

0.34 

0.97 

0.34 

0.18 

G 3 

0.94 

0.66 

0.33 

- 

0.88 

g:3 

0.95 

0.67 

0.34 

0.96 

0.89 

G 4 

0.90 

0.67 

0.18 

0.88 

- 

Gi 

0.92 

0.67 

0.19 

0.89 

0.97 


Figure 14: Relative similarity between the groups of H, Gs and G's. 


In the current setting we showed that functions with peaks at different points will discover 
different triples. Most of the times in real data there seems to be no practical need for this added 
generality, however having this ability at hand may prove useful in certain settings. Also, since 
the difference is small compared to the rate of group change in the Enron data, hence there is not 
much value added by a general propagation delay function to justify the increase in computation 
cost from linear to quadratic time. 

13.4 Tracking the Evolution of Hidden Groups 

For chains the significance threshold frequencies were Kchain = 30 and ^sibling = 160. We used 
a sliding window of one year to obtain evolving hidden groups. On each window we obtained 
the significant chains and siblings (frequency > k) and the clusters in the corresponding weighted 
overlap graph. We use the clusters to build the communication structures and show the evolution 
of one of the hidden groups in Fig. [11] without relying on any semantic message information. The 
key person in this hidden group is actor C, who is Peggy Heeg, Senior Vice President of El Paso 
Corporation. El Paso Corporation was often partnered with ENRON and was accused of raising 
prices to a record high during the “blackout” period in California mm- 

13.5 Estimating the Rate of Change for Coalitions in the Blogosphere 

Next we would like to show how the approaches of distance measure, presented in this thesis, 
can be used to track the evolution and estimate the rate of change of the clusterings and groups 
over time. As our example we studied the social network of the Blogosphere (Live Journal). We 
found four clusterings Gi, G 2 , C 3 and 6*4 by analyzing the same social network at different times. 
Each consecutive clustering was constructed one week later than the previous. The task of this 
experiment is to find the amount of change that happened in this social network over the period 
of four weeks. The sizes of the clusterings Ci, C 2 , C 3 and (174 are 81348, 82056, 82132 and 80217 
respectively, while the average densities are 0.630, 0.643, 0.621 and 0.648. 

We can see in the Fig. [15] that the Best Match and the K-center algorithms imply that the rate 
of change of groups in the blogosphere is relatively high and the groups change very dynamically 



G 1 -G 2 

G 2 -G 3 

G 3 -C 4 

Average Change 

Best Match 

4.31 

5.01 

4.83 

4.72 


Figure 15: The rate of change of the clusterings in Blogosphere over the period of four weeks. 
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from one week to another. 




C’2-C', 


Avg. Change 

Best Match 

0.3 

0.23 

0.24 

0.26 


Figure 16: The rate of change of the clusterings in the Enron organizational structure from 2000 - 

2002. 

13.6 Estimating the Rate of Change for Groups in the Enron Organizational 
Structure 

Another experiment we conducted, using the proposed distance measures, is estimating the rate 
of change for groups in the Enron organizational structure. We used Enron email corpus and the 
approach proposed in [6] to obtain clusterings with statistically significant persistent groups in 
several different time intervals. 

On each window we first obtained the significant chains and siblings and then the clusterings 
in the corresponding weighted overlap graph. The clusterings CJ, C^, C'^ and C'^ with average 
densities 0.65, 0.7, 0.71 and 0.67 respectively, were computed based on the intervals Sept. 1999 - 
Sept. 2000, Mar. 2000 - Mar. 2001, Sept. 2000 - Sept. 2001 and Mar. 2001 - Mar. 2002. 

Next we used the Best Match and K-center algorithm to track the rate of change in the network. 
Fig. Uni illustrates the rate of change as well as the average rate of change of the clusterings. Notice 
that the rate of change in the email networks over a 6 month period are significantly lower than 
the rate of change in Blogs over a 1 week period. Blogs are a significantly more dynamic social 
network which should be no surprise. 

The Fig. [11] illustrates the structure of a single group in each of the clusterings as well as it 
gives a sense of its evolution from one time interval to next. 

The ability to account for the overlap and evolution dynamics is the underlying reason why 
the distance found by the Best Match and K-center algorithms is relatively low for groups in the 
ENRON dataset. 

As a conclusion we would like to point out that Blogs and ENRON are two completely different 
social networks. ENRON represents a company network, which has the underlying hierarchy of 
command, which is unlikely to change quickly over time, while Blogosphere is a much more dynamic 
social network, where groups and their memberships can change rapidly. This behavior is well 
reflected in the experiments described above. 

13.7 Tree Mining Validation 

Additionally for the purpose of validation, we used the tree mining approach in conjunction with 
the heuristic algorithms , in order to verify that a discovered tree-like structure actually occurs 
frequently in the data. For the experiment, we once again used the ENRON email corpus and the 



Cl -Ti 

C 2 -T 2 

c3-n 

C'4-r4 

Best Match 

0.323 

0.321 

0.294 

0.389 


Figure 17: The similarity between the trees and the clusterings in the Enron organizational structure 
from 2000 - 2002. 

time intervals Sept. 1999 - Sept. 2000, Mar. 2000 - Mar. 2001, Sept. 2000 - Sept. 2001 and Mar. 
2001 - Mar. 2002. Eor each interval were found significant chains and siblings and performed the 
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clustering on the weighted graph of overlapping triples. The clusterings Ci, C 2 , C 3 and (74 were 
found. Next we performed tree mining in order to extract exact tree like communication patterns 
for the same intervals and obtained Ti, T 2 , T 3 and T^. The same signihcance threshold frequencies 
were used Kchain = 35 and HsibUng = 160 when we found Ci, C 2 , C 3 and C 4 . 



C[ - Ti 

c^-n 

c',-n 

ci-n 

Best Match 

0.411 

0.407 

0.414 

0.41 


Figure 18: The similarity between the trees and the clusterings in the Blogosphere over the period 
of 4 weeks 

We also performed the same experiment on the Blogosphere, where we randomly picked the set 
of 4 consecutive weeks and discovered groups by performing our heuristic clustering approach to 
obtain clustering (7(, C 2 , C'^ and Next we found exact tree like communication structures T{, 
T 2 , and T 4 for each week respectively. 

We used the Best Match and the K-center algorithms to measure the amount of similarity 
between these two sets. You can find the results of these measurements in the Fig. [T3 and [HI 
The groups which we find using a heuristic clustering approach compare well to the actual tree-like 
structures present in the data. 

Additionally we would like to bring your attention to the Fig. [15] and [18] to point out that 
despite the rapid and dynamic rate of change in the Blogosphere as a system, the relative distance 
which is found between respective T’s and (7’s remained low. This suggests that our algorithms 
for discovering planning hidden groups are able to perform well for very dynamic systems as Blo¬ 
gosphere as well as the more stable systems as ENRON. Notice that the slightly higher similarity 
in the ENRON data could be caused by the fact that the underlying hierarchy like structure of the 
ENRON company resembles the tree like patterns much more then a chaotic Blogosphere. Never¬ 
theless the discovered similarity for the groups in the Blogosphere data is still suggesting that the 
groups we discover using our heuristic approach are similar in their nature to the groups discovered 
by performing tree mining. Thus this section provides yet another prove of that our algorithms find 
real and meaningful groups in the streaming communication data by using no message content. 

14 Conclusions 

In this work, we described algorithms for discovering hidden groups based only on communication 
data. The structure imposed by the need to plan was a very general one, namely connectivity. 
Connectivity should be a minimum requirement for the planning to take place, and perhaps adding 
further constraints can increase the accuracy or the efficiency. 

In our algorithms there is no fixed communication cycle and the group’s planning waves of 
communications may overlap. The algorithm first finds statistically significant chain and sibling 
triples. Using a heuristic to build from triples, we find hidden groups of larger sizes. Using a moving 
window and matching algorithms we can track the evolution of the organizational structure as well 
as hidden group membership. Using a tree querying algorithm one can query a hierarchical structure 
to check if it exists in the data. The tree mining algorithm finds exactly all of the frequent trees 
and can be used for verification purposes. Our statistical algorithms serve to narrow down the set 
of possible hidden groups that need to be analyzed further. 

We validated our algorithms on real data and our results indicate that the hidden group algo¬ 
rithms do indeed find meaningful groups. Our algorithms don’t use communication content and 
don’t differentiate between the natures of the hidden groups discovered, for example some of the 
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hidden groups may be malicious and some may not. The groups found by our algorithms can be 
further studied by taking into account the form and the content of each communication, to get a 
better overall result and to identify the truly suspicious groups. 
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